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FPGA-Based, Self-Checking, Fault-Tolerant Computers 

No software support and little hardware support would be needed for fault tolerance. 
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The Memory System would store the states of computations at checkpoint intervals. Upon detection 
of an error in the self-checking processor module, the memory system would provide the information 
needed to roll back the computations to the immediately preceding checkpoint. 


A proposed computer architecture 
would exploit the capabilities of com- 
mercially available field-programmable 
gate arrays (FPGAs) to enable computers 
to detect and recover from bit errors. 
The main purpose of the proposed ar- 
chitecture is to enable fault-tolerant com- 
puting in the presence of single-event 
upsets (SEUs). [An SEU is a spurious bit 
flip (also called a soft error) caused by a 
single impact of ionizing radiation.] The 
architecture would also enable recovery 
from some soft errors caused by electri- 
cal transients and, to some extent, from 
intermittent and permanent (hard) er- 
rors caused by aging of electronic com- 
ponents. 

A typical FPGA of the current genera- 
tion contains one or more complete 
processor cores, memories, and high- 
speed serial input/output (I/O) chan- 
nels, making it possible to shrink a 
board-level processor node to a single 
integrated-circuit chip. Custom, highly 
efficient microcontrollers, general-pur- 
pose computers, custom I/O processors, 
and signal processors can be rapidly and 
efficiently implemented by use of 
FPGAs. Unfortunately, FPGAs are sus- 
ceptible to SEUs. Prior efforts to miti- 
gate the effects of SEUs have yielded so- 
lutions that degrade performance of the 
system and require support from exter- 
nal hardware and software. 

In comparison with other fault-toler- 
ant-computing architectures (e.g., triple 
modular redundancy) , the proposed ar- 
chitecture could be implemented with 
less circuitry and lower power demand. 
Moreover, the fault-tolerant computing 
functions would require only minimal 
support from circuitry outside the cen- 
tral processing units (CPUs) of comput- 
ers, would not require any software sup- 
port, and would be largely transparent 
to software and to other computer hard- 
ware. 

There would be two types of modules: 
a self-checking processor module and a 
memory system (see figure). The self- 
checking processor module would be 
implemented on a single FPGA and 
would be capable of detecting its own 
internal errors. It would contain two 


CPUs executing identical programs in 
lock step, with comparison of their out- 
puts to detect errors. It would also con- 
tain various cache local memory cir- 
cuits, communication circuits, and 
configurable special-purpose processors 
that would use self-checking checkers. 
(The basic principle of the self-checking 
checker method is to utilize logic cir- 
cuitry that generates error signals when- 
ever there is an error in either the 
checker or the circuit being checked.) 

The memory system would comprise a 
main memory and a hardware-con- 
trolled check-pointing system (CPS) 
based on a buffer memory denoted the 
recovery cache. The main memory 
would contain random-access memory 
(RAM) chips and FPGAs that would, in 
addition to everything else, implement 
double-error-detecting and single-error- 
correcting memory functions to enable 
recovery from single-bit errors. 

The main purpose served by the 
memory system as a whole would be to 
enable the computer to return to a valid 
state — a known good point reached in 
the computations before the occur- 
rence of a detected error. In operation, 
the checkers in the self-checking 
processor module would signal errors to 
the memory system. Recovery would in- 
volve halting the operation of the self- 
checking processor module, correcting 
its configuration bits if necessary, re- 
loading its registers, and returning con- 
trol to a previous, known good point in 
the program. The CPUs could then re- 
sume correct computations. 


The known good point in the compu- 
tations would be provided by the CPS in 
a procedure denoted, variously, as check- 
pointing and checkpoint recovery or 
checkpoint rollback. The CPS would pe- 
riodically command each CPU to store 
the contents of its registers in the recov- 
ery cache and clear its caches. This ac- 
tion would establish a checkpoint. Then 
the original value and the address of any 
clean RAM block that was subsequently 
overwritten by the CPU would be stored 
in a special RAM within the recovery 
cache. Subsequent writes to that block 
would be carried out normally (that is, 
without intervention by the recovery 
cache). If an error in the CPU were de- 
tected, the data in the special recovery- 
cache RAM could be used to restore the 
corresponding data in the main memory 
to their prior correct values, the proces- 
sor configuration would be reloaded, the 
caches in the processor module would be 
cleared, and the processor registers re- 
stored to their prior values. 

A new checkpoint could be ordered 
when the recovery cache became filled to 
capacity. Alternately, checkpoints could 
be forced at strategic points in the soft- 
ware. Another alternative would be to 
force checkpoints periodically, at inter- 
vals short enough to ensure that rollback 
time did not exceed a value that could be 
specified by design. 

This work was done by Raphael Some and 
David Rennets of Caltech for NASA’s Jet 
Propulsion Laboratory. Further informa- 
tion is contained in a TSP (see page 1 ). 
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